This lab focuses on $K$-means clustering using the Iris flower data set. At the end of the lab, you should be able to:
Let's start by importing the packages we'll need. As usual, we'll import pandas
for exploratory analysis, but this week we're also going to use the cluster
subpackage from scikit-learn
to create $K$-means models and the datasets
subpackage to access the Iris data set.
In [ ]:
%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import cluster
from sklearn import datasets
Next, let's load the data. The iris data set is included in scikit-learn's datasets
submodule, so we can just load it directly like this:
In [ ]:
iris = datasets.load_iris()
X = pd.DataFrame({k: v for k, v in zip(iris.feature_names, iris.data.T)}) # Convert the raw data to a data frame
X.head()
In [ ]:
pd.plotting.scatter_matrix(X, c=iris.target, figsize=(9, 9));
The colours of the data points here are our ground truth, that is the actual labels of the data. Generally, when we cluster data, we don't know the ground truth, but in this instance it will help us to assess how well $K$-means clustering segments the data into its true categories.
Let's build an $K$-means clustering model of the document data. scikit-learn
supports $K$-means clustering functionality via the cluster
subpackage. We can use the KMeans
class to build our model.
Generally, we won't know in advance how many clusters to use but, as we do in this instance, let's start by splitting the data into three clusters. We can run $K$-means clustering with scikit-learn
using the KMeans
class. We can specify n_clusters=3
to find three clusters, like this:
In [ ]:
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X)
Note: In previous weeks, we have called
fit(X, y)
when fittingscikit-learn
estimators. However, in each of these cases, we were fitting supervised learning models wherey
represented the true class labels of the data. This week, we're fitting $K$-means clustering models, which are unsupervised learners, and so there is no need to specify the true class labels (i.e.y
).
When we call the predict
method on our fitted estimator, it predicts the class labels for each record in our explanatory data matrix (i.e. X
):
In [ ]:
labels = k_means.predict(X)
print labels
We can check the results of our clustering visually by building another scatter plot matrix, this time colouring the points according to the cluster labels:
In [ ]:
pd.plotting.scatter_matrix(X, c=labels, figsize=(9, 9));
As can be seen, the $K$-means algorithm has partitioned the data into three distinct sets, using just the values of petal length, petal width, sepal length and sepal width. The clusters do not precisely correspond to the true class labels plotted earlier but, as we usually perform clustering in situations where we don't know the true class labels, this seems like a reasonable attempt.
We can cluster the data into arbitrary many clusters (up to the point where each sample is its own cluster). Let's cluster the data into two clusters and see what effect this has:
In [ ]:
k_means = cluster.KMeans(n_clusters=2)
k_means.fit(X)
labels = k_means.predict(X)
pd.plotting.scatter_matrix(X, c=labels, figsize=(9, 9));
One way to find the optimum number of clusters is to plot the variation in total inertia with increasing numbers of clusters. Because the total inertia decreases as the number of clusters increases, we can determine a reasonable, but possibly not true, clustering of the data by finding the "elbow" in the curve, which occurs as a result of the diminishing returns from adding further clusters.
We can access the inertia value of a fitted $K$-means model using its inertia_
attribute, like this:
In [ ]:
clusters = range(1, 10)
inertia = []
for n in clusters:
k_means = cluster.KMeans(n_clusters=n)
k_means.fit(X)
inertia.append(k_means.inertia_)
plt.plot(clusters, inertia)
plt.xlabel("Number of clusters")
plt.ylabel("Inertia");
In this instance, we could choose either two or three clusters to represent the data, as these represent the largest decreases in inertia. As we know that there are three true classes choosing two would be an incorrect conclusion in this case, but this is an unavoidable consequence of clustering. If we do not know the structure of the data in advance, we always risk choosing a representation of it that does not reflect the ground truth.